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Abstract 

Background: The collection of gene expression profiles from DNA microarrays and their analysis with pattern 
recognition algorithms is a powerful technology applied to several biological problems. Common pattern 
recognition systems classify samples assigning them to a set of known classes. However, in a clinical diagnostics 
setup, novel and unknown classes (new pathologies) may appear and one must be able to reject those samples 
that do not fit the trained model. The problem of implementing a rejection option in a multi-class classifier has 
not been widely addressed in the statistical literature. Gene expression profiles represent a critical case study since 
they suffer from the curse of dimensionality problem that negatively reflects on the reliability of both traditional 
rejection models and also more recent approaches such as one-class classifiers. 

Results: This paper presents a set of empirical decision rules that can be used to implement a rejection option in 
a set of multi-class classifiers widely used for the analysis of gene expression profiles. In particular, we focus on the 
classifiers implemented in the R Language and Environment for Statistical Computing (R for short in the remaining 
of this paper). The main contribution of the proposed rules is their simplicity, which enables an easy integration 
with available data analysis environments. Since in the definition of a rejection model tuning of the involved 
parameters is often a complex and delicate task, in this paper we exploit an evolutionary strategy to automate this 
process. This allows the final user to maximize the rejection accuracy with minimum manual intervention. 

Conclusions: This paper shows how the use of simple decision rules can be used to help the use of complex 
machine learning algorithms in real experimental setups. The proposed approach is almost completely automated 
and therefore a good candidate for being integrated in data analysis flows in labs where the machine learning 
expertise required to tune traditional classifiers might not be available. 



Background 

Microarrays are one of the latest breakthroughs in 
experimental molecular biology. They allow to simulta- 
neously monitoring the expression level of tens of thou- 
sands of genes. Arrays coupled with pattern recognition 
methods have been applied to studies in gene expression, 
genome mapping, transcription factor activity, toxicity, 
pathogen identification, and many other applications 
[1-10] 
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However, although in standard classification problems 
one has to classify a sample and assign it to one of a set 
of known classes, in a clinical diagnostics setup in which 
some classes (phenotypes) may be known but novel 
unknown classes (new phenotypes) may appear as well, 
one must be able to reject those samples that do not fit 
the trained model. 

In this paper, we present a set of empirical decision 
rules designed to implement a rejection option in a set 
of multi-class classifiers widely used for the analysis of 
gene expression profiles. In particular, we will focus on 
the R Language and Environment for Statistical Com- 
puting (R for short in the remaining of this paper) [11]. 
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The problem of implementing a rejection option in a 
multi-class classifier has not been widely addressed in 
the statistical literature with the exception of a few pub- 
lications [12-15]. Chow [12] put forth the decision theo- 
retic framework to rejection in pattern recognition. The 
overall idea is to estimate the class conditional probabil- 
ities for a sample and to reject it if the maximum prob- 
ability is below a given threshold. This simple rejection 
rule is optimal when the class conditional probabilities 
can be estimated without errors, which is in contrast 
with several real setups [16]. Gene expression profiles 
suffer from the curse of dimensionality problem [17] 
that negatively reflects on the reliability of probability 
estimators. The number of available classes and the cor- 
rect setup of the threshold are additional constraints 
that limit the reliability of this approach. An attempt to 
setup per-class thresholds has been proposed by Fumera 
et al. [18] to mitigate errors in probability estimation. 
However, the computational effort and the complexity 
of tuning the resulting classification system increases, 
limiting a widespread application in laboratory setups. 
Recently, one-class classifiers gained attention in the 
implementation of rejection systems in gene expression 
profiles [19-22]. These algorithms base the prediction 
model on the concept of distance among samples rather 
than on the estimation of class conditional probabilities. 
They therefore overcome the limited reliability of avail- 
able class probability estimators. However, increased 
number of classes, high dimensionality feature spaces 
such as the one of microarray datasets, noisy features, 
and quite often not enough samples, still limit their 
accuracy. 

In this paper we will build a set of rejection rules able 
to work with the very simple and often unreliable class 
probability estimators provided with the multi-class clas- 
sifiers implemented in R (see the Methods section for 
further details). The main contribution of the proposed 
rules is their simplicity. It makes possible an easy inte- 
gration with available data analysis environments while 
maintaining, at the same time, good classification per- 
formance. Since in the definition of a rejection model 
tuning of the involved parameters is often a complex 
and delicate task, in this paper we exploit an evolution- 
ary strategy to automate this process. This allows the 
final user to maximize the rejection accuracy with mini- 
mum manual intervention. 

A complete experimental setup is presented to validate 
the proposed model on a challenging data-set of blood 
diseases. A set of three multi-class classifiers widely 
adopted in the analysis of gene expression profiles 
which are also available in R has been considered. 
Results are compared to those obtained building rejec- 
tion options based on one-class classifiers [23]. Results 
show that the proposed decision rules can be efficiently 



used as a powerful rejection method, outperforming 
most of the considered one-class classifiers. 

Results and discussion 

Experimental setup 

The results of this paper have been validated on a data- 
set of gene expression profiles from complementary 
DNA (cDNA) microarrays related to very similar pheno- 
types. Only a reduced subset of genes allows for discri- 
mination (Table 1). This peculiarity increases the 
complexity of the classification allowing us to better 
validate the proposed method. It is worth mentioning 
here that, in all experiments, the training-set does not 
include any sample from the test-set. This is a given 
requirement to avoid overoptimistic results and there- 
fore to honestly evaluate the classifiers performances. 

The data-set includes a total of 7 phenotypes. Samples 
have been downloaded from the cDNA Stanford Micro- 
array database [24]. All genes without a valid UnigenelD 
have been discarded. The expression level of each gene 
is measured as the log-ratio between the Cy5 and the 

Cy5 



Cy3 channel of the array: log 2 



Cy3 



Four sets of samples have been downloaded from a 
large set of experiments aiming at performing Lym- 
phoma Classification [25,26]: 

♦ Diffuse Large B-Cell Lymphoma (DLBCL): a non- 
Hodgkin lymphoma disease, 

♦ B-Cell Chronic Lymphocytic Leukemia Wait& Watch 
(CLLww), 

♦ B-Cell Chronic Lymphocytic Leukemia (CLL), and 

♦ Follicular Lymphoma (FL): independent lymphonode 
samples on LymphoChip microarrays [27]. 

The remaining three phenotypes in the data-set are: 

♦ Acute Lymphoblastic Leukemia (ALL), 

♦ Core Binding Factor Acute Myeloid Leukemia (CBF- 
AML): subgroups characterized by shorter overall survival 
[28], 



Table 1 Structure of the considered data-set 



Blood data-set 


Phenotype acronym 


#Samples 


DLBCL 


61 


CLLww 


31 


CLL 


21 


ALL 


27 


CBF-AML 


21 


FL 


24 


AML2 


11 


Total 


196 



The table reports the acronym of each phenotype and the related number of 
samples. 
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♦ Acute Myeloid Leukemia 2 data-set (AML2): periph- 
eral-blood samples or bone marrow samples of inter- 
mediate-risk AML with a normal karyotype. 

Three multi-class classifiers often used in gene expres- 
sion profile analysis have been considered in this study: 
k-Nearest Neighbors (k-NN), feed-forward Neural 
NETwork with a single hidden layer (N-NET), and Ran- 
dom Forests (RF). All algorithms are available in R. A 
detailed description of how data have been processed 
and how the prediction models for the different classi- 
fiers have been trained is available in the Methods 
section. 

Class probability estimates analysis and decision rules 

The process of detecting samples to reject in a multi- 
class classification system can be modeled as a binary 
classification test discriminating between samples that 
belong to one of the known classes (target samples) and 
samples that do not belong to any of them (reject sam- 
ples). The outcome of the test is measured in terms of: 

♦ true positives (TP): target samples correctly accepted, 

♦ true negatives (TN): reject samples correctly rejected, 

♦ false positives (FP): reject samples erroneously 
accepted, and 

♦ false negatives (FN): target samples erroneously 
rejected. 

The number of TP, TN, FP, and FN adds up to 100% 
of the data-set. The accuracy in which target samples 
are assigned to the corresponding class is out of the 
scope of this work and depends on the accuracy of the 
selected multi-class classifier. 

The multi-class classifiers considered in this paper 
(RF, N-NET and k-NN) do not natively implement a 
rejection option. Discarding reject samples by setting a 
single threshold on the class probability estimates is 
inaccurate since class probability estimates show small 
differences between target and reject samples (refer to 
the Methods section for specific details on how class 
probability estimates have been computed). However, 
this information can still be used for discrimination if 
coupled with well tuned decision rules. 

In order to perform a preliminary qualitative analysis of 
how class probability estimates change between target 
and reject samples, we performed a set of multi-class 
classification experiments generating different splits of 
the considered data-set (in terms of targets/reject sam- 
ples and test/ training data). For each split, the multi- class 
classifiers have been trained on a subset of the considered 
phenotypes, using the remaining data as a set of samples 
to reject. Figure 1 reports, for each classifier, two density 
plots that show how the value of class probability esti- 
mates of target and reject samples distribute in the per- 
formed experiments. In the MAX plot the considered 
random variable is the highest class probability estimate 
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Figure 1 Probability density plots. Probability density plots to 
show how the value of class probability estimates of target and 
reject samples distribute over a set of experiments. In the MAX plot 
the considered random variable is the highest class probability 
estimate of each classified sample, split into target samples (solid 
line) and reject samples (dashed line). In the DIFF plot the 
considered random variable is the difference between the two 
highest probability estimates of each sample, considering target 
samples (solid line) and reject samples (dashed line). 

V ' ) 

of each classified sample, split into target samples (solid 
line) and reject samples (dashed line); in the DIFF plot 
the considered random variable is the difference between 
the two highest probability estimates of each sample, 
again considering target and reject samples. The density 
functions have been estimated from the experimental 
data by performing a Gaussian kernel density estimation 
using the density() command of R. 

Although the plots of Figure 1 may seem to suggest a 
strong overlap between the distributions of target sam- 
ples (solid lines) and reject samples (dashed lines), a cer- 
tain amount of separation is still visible. This is 
particularly evident in the case of RF, that shows a quite 
visible distinction both in the MAX and in the DIFF 
plots. In particular, in the DIFF plot of RF, target sam- 
ples (solid line) have a max around 0.8, far from the 
max of reject samples (dashed line) that falls around 0.1. 
This means that, for a target sample, the difference 
between the two top rated classes is very high (around 
0.8 in most of the cases). Instead, reject samples show a 
very low difference between probability estimates of the 
two top ranked classes, revealing the inability of the 
classifier to clearly select a target class. k-NN and N- 
NET show smaller separation; however, experimental 
results will show that a partial discrimination is still 
possible. 
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From this preliminary analysis, it seems reasonable 
that, for the three considered classifiers, the class prob- 
ability estimates provided by R could potentially be used 
for detecting reject samples. The idea exploited in this 
paper to design a set of decision rules for detecting 
reject samples is to split the MAX plot into three dis- 
tinct areas: (i) max area, (ii) decision area, and (iii) reject 
area, delimited by two rejection thresholds T max and T re j 
(J max >T re j), as shown in Figure 2 for RF. 

Figure 3 describes the overall decision process applied 
to each sample that must be classified, maxi and max 2 
denote the two highest class probability estimates for 
the considered sample. 

If the highest class probability estimate (maxi) is 
lower than T re p then the sample falls in the reject area 
and is rejected to maximize the number of TN (Rule Rl 
Figure 3, rows 1-2). Similarly, if max\ is higher than 
Tmax* the sample falls in the max area and can be 
accepted to maximize the number of TP. The class with 
probability estimate equal to max\ is predicted (Rule 
R2-Figure 3, rows 4-5). The first part of this decision 
process is very similar to the single threshold method 
proposed in [12]. 

Whenever neither Rl or R2 are satisfied, i.e., max\ 
falls between T re j and T max (decision area) there are two 
possible conditions based on the analysis of the differ- 
ence between max\ and max 2 (DIFF plot of Figure 2): 

1. if maxi - max 2 >Tdff> the sample can be accepted 
and the class with probability estimate equal to max\ is 
predicted (Rule R3.1-Alg. 3, rows 7-8). T di ff is the mini- 
mum difference between the probability estimates of the 
two top ranked classes that allows us to use maxi to 
perform a reliable classification; 



2. if max x - max 2 <^dijf > the value of max 2 is consid- 
ered. Two cases are possible: 

♦ if max 2 is higher than T re p i.e., both maxi and max 2 
fall in the decision area, the prediction is considered 
uncertain (Rule R3.2-Alg. 3, rows 10-11). In this case 
the classifier does not produce any classification result. 
This rule prevents from providing a result when the dis- 
tinction between two classes is not sufficient to correctly 
discriminate. In alternative, multiple classification results 
can be provided to alert the user that the confidence in 
the prediction is low; 

♦ if max 2 is lower than T re p the sample is rejected (Rule 
R3.3-Alg. 3, row 13). This rule mitigates the noise in 
those samples that fall at the border of the reject area. 

The three rejection thresholds (T max , T re p and Tdiff) 
can be empirically chosen analyzing the density plots of 
Figure 2: 

♦ if the MAX plot shows a clear separation between 
target and reject samples, T max can be placed in such a 
way to maximize the number of TP immediately 
detected by rule R2; 

♦ similarly to T maxi T re j can be placed to maximize the 
number of TN detected by rule Rl; 

♦ the definition of Tdiff is performed looking at the 
DIFF plot. A good heuristic is to consider the point 
where the two curves intersect. 

Manually setting the three thresholds is very complex 
and may easily lead to a high error rate. When the plots 
do not show a clear separation between target and reject 
samples, the choice of the thresholds involves a trade-off 
between increasing the sensitivity, and lowering the spe- 
cificity of the classifier. This is a complex optimization 
task. 
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Figure 2 Definition of the rejection thresholds in the case of RF. The MAX plot is split into three distinct zones: (i) max area, (ii) decision 
area, and (iii) reject area. 
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Require: max\, max 2 , T rej , T max , T dif f 

l: if maxi < T re j then 
2: return (Reject) {Rule Rl} 

3: else 

4: if maxi > T rnax then 

5: return (Target maxi ) {Rule R2} 

6: else 

7: if maxi — max2 > Tdiff then 

8: return (Target max i) {Rule R3.1} 

9: else 

10: if max2 > T re j then 

11: return (Uncertain) {Rule R3.2} 

12: else 

13: return (Reject) {Rule R3.3} 

14: end if 

15: end if 
16: end if 

17: end if 

Figure 3 Decision Rule Pseudo-code. Pseudo-code describing the decision rules able to discriminate between target and reject samples based 
on the class probability estimates of the sample and on the computed rejection thresholds. 



All thresholds must be setup only considering infor- 
mation extracted from the considered training data. To 
tackle the complexity of this process, and to allow the 
automatic tuning of all rejection parameters, a threshold 
setup algorithm based on a Covariance Matrix Adapta- 
tion Evolutionary Strategy (CMA-ES) has been devel- 
oped. The full description of this algorithm is available 
in the Methods section. 



Architecture of the multi-class classifier with rejection 
option 

The proposed decision rules can be easily integrated 
within the multi-class classification flow provided by R 
or other similar computational environments. Figure 4 
shows the computational flow of the resulting system. 
As usual when working with classification algorithms, a 
training set containing known samples is used to train 
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Figure 4 Computational flow. Computational flow to setup a multi-class classification system with a rejection option based on the proposed 
decision rules. 



the prediction model then used to classify a set of test modules required to: 1) setup the rejection thresholds, 

(unknown) samples. and 2) apply the decision rules. 

Compared to a traditional multi-class classification Setting up the rejection thresholds requires collecting 

flow, the proposed system includes two additional a statistically significant set of class probability estimates 
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for both target and reject samples on which to com- 
pute the density plots of Figure 2. At this stage, in 
which the model is trained and real reject samples are 
not available, this information can only be collected 
starting from samples in the training set by setting up 
several cross-validation experiments on different folds 
of these training data. Figure 5 outlines the way this 
module operates. Let us denote with T the full training 
set and with T t a portion of it including only those 
samples related to a specific phenotype. If k classes are 
included in T, k subsets of experiments are generated 
by iteratively excluding one of the classes T t from T to 
form a new target class T. The removed samples are 
used as a set of reject samples denoted with R (Figure 
5, rows 1-3). 

For each subset (Figure 5, row 4), m folds are gener- 
ated by removing x samples from each subclass con- 
tained in T\ and x samples from R. Each fold will 
therefore generate a test-set (Test") of x • (k - 1) target 
samples , and x reject samples. To avoid overoptimistic 
results, target samples of the test-set are removed from 
T forming a new training set Train* (Figure. 5, rows 5- 
8). Each fold is then used for an independent multi-class 
classification experiment to obtain the class probability 
estimates of each test sample in Test*. After running all 
k - m experiments, the CMA-ES analyzes the collected 
probability estimates and provides a set of optimal 



thresholds (refer to the Methods section for a complete 
description of this step). 

Validation and discussion 

The proposed rejection rules have been tested on differ- 
ent groups of experiments based on different configura- 
tions of the considered data-set in terms of target and 
reject samples. The rejection accuracy has been com- 
pared with the one of a set of selected one-class classi- 
fiers. Five one-class classification algorithms have been 
considered in this comparison: 

♦ Parzen one-class classifier (Parzen-OC), 

♦ k-NN one-class classifier (k-NN-OC), 

♦ K-Means one-class classifier (k-Means-OC), 

♦ Principal Component Analysis (PCA) one-class clas- 
sifier (PCA-OC), and 

♦ SVDD, a SVM based one-class classifier (SVM-OC). 
All one-class classifiers have been implemented and 

optimized using Matlab's DD_tools [29], a standard 
implementation used in several publications on microar- 
ray analysis [19-22]. 

Two methods of using one-class classifiers have been 
considered. Let us suppose to have a target class includ- 
ing k different subclasses (phenotypes). The one-class 
classification problem can be solved by training either k 
different one-class classifiers (one for each subclass) 
with samples rejected if rejected by all classifiers, or a 



1: for i=l to k do 

2: T ! = T — Ti {Remove samples of subclass i from the training} 

3: R = Ti {Removed samples are used as outliers} 

4: for j=l to m do 

5: Test tar get = x random samples from each class in T r 

6: Test out = x random samples from R 

7: Test* = Test target + Test out 

8: Train* = V -Test target 

9: Build multi-class classifier with Train 

10: Classication 

11: Class probability estimates computation 

12: end for 

13: end for 

14: Generate_Thresholds (Class probability estimates) 

Figure 5 Pseudo-code description of the threshold setup process. 
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single one-class classifier trained with samples of all k 
subclasses. The first approach will be referred to as 
Multi One-Class with Voting (MOCV) while the second 
approach will be referred to as Single One-Class with 
Multiple Targets (SOCMT). 

Four groups of experiments each denoted as (k s 
[3, 6]) have been generated. In the group G k the target 
class includes k out of the 7 available phenotypes. Sam- 
ples in the remaining 7 - k classes must be rejected. For 
each group different random combinations of the k 
classes included in the target profile are considered and 
for each combination several random splits of data into 
test- and training-set are generated for a total amount 
of 40 experiments for each group. For all experiments, 
the test-set includes a balanced number of target and 
reject samples. 

For each experiment data are classified with MOCV, 
SOCMT, and the three considered multi-class classifiers 
paired with the decision rules presented in this paper. 
Rejection thresholds have been automatically computed 
according to the process described in Figure 5. 

Table 2 summarizes the results of the classification. It 
provides for each classifier (rows), and for each group of 
experiments (column groups), the average sensitivity 
(Sens) and specificity (Spec) with the corresponding 
Confidence Intervals (CI) computed with 95% Level of 
Confidence (LOC). RF coupled with the decision rules is 
the classifier that globally better performs with respect 
to any other available option (its performance is high- 
lighted in bold in Table 2). The other two multi-class 
classifiers (N-NET and k-NN) are also comparable or in 
some cases better than one-class classifiers. This result 
can be better appreciated looking at Table 3 reporting 
the average accuracy improvement of the proposed 
approach over the two possible configurations of one- 
class classifiers. Accuracy is computed as the percentage 
of samples correctly classified (TP + TN) over the total 
amount of classified samples. Averages are computed 
over the 40 experiments of the corresponding group. 
Looking at the performance of RF (highlighted in bold) 
one can notice a significant improvement of the accu- 
racy in all experiments compared to one-class classifiers. 
The table also highlights how the other two multi-class 
classifiers have performances comparable to one-class 
classifiers in most of the experiments. 

A final confirmation of the improvement introduced 
by the presented approach can be appreciated in the 
Receiver Operating Characteristic (ROC) curves of Fig- 
ure 6. For each group of experiments the related ROC 
curve compares the average performance of the three 
multi-class classifiers coupled with the decision rules 
and the two best one-class classifiers (Parzen-OC and 
SVM-OC). In the case of multi-class classifiers the ROC 
curve is plotted by changing the value of the three 



rejection thresholds in order to explore as much as pos- 
sible the space of possible solutions, while, in the case 
of one-class classifiers, it is obtained by changing the 
considered rejection rate. In all experiments RF 
improves the accuracy of the one-class classifiers while 
k-NN and N-NET provide an accuracy that is compar- 
able to those of one-class classifiers. This result is 
obtained using a very simple computational model com- 
pared to the one required to setup a one-class classifica- 
tion model. 

All proposed results have been obtained by computing 
the rejection thresholds using the CMA-ES with the SSI 
objective function (see the Methods section). The dia- 
gram of Figure 7 shows the average accuracy obtained 
during the cross-validation experiment used to setup the 
rejection thresholds. Proposed results are for RF coupled 
with the rejection rules considering the different CMA- 
ES objective functions. The dashed diagonal lines repre- 
sent iso-accuracy lines, with accuracy that decreases 
from the top-left corner to the bottom-right corner. The 
graph shows that the three functions SS, SSI, and SS2 
provide the better accuracy with SSI providing the best 
results. 

The value of the objective function associated with the 
thresholds computed by the CMA-ES can be used as an 
indicator of the reliability of the trained model. When- 
ever an objective function is equal to 0, it means perfect 
discrimination among target samples and reject samples. 
Values greater than 0 indicate reduced accuracy. This is 
confirmed by the results of Table 4. It reports for each 
classifier and for each group of experiments the average 
accuracy and the average value of the SSI objective 
function associated with the computed thresholds. The 
numbers clearly show how RF, that is the one with bet- 
ter accuracy, has a lower value of the objective function 
compared k-NN and N-NET, thus suggesting a more 
reliable model. 

Conclusions 

Life sciences are undergoing a true revolution as a result 
of the emergence and growing impact of a series of new 
disciplines/tools including genomics, transcriptomics, 
proteomics and metabolomics. These new disciplines 
are devoted to the examination of the entire systems of 
genes, transcripts, proteins and metabolites present in a 
given cell or tissue type. New technologies allow now to 
collect huge amounts of data dramatically modifying the 
way scientific research is carried out. The focus is shift- 
ing from the study of "isolated realities" to the under- 
standing of whole biological systems and the 
interactions between the huge number of their indivi- 
dual components. From the beginning of this revolution, 
machine learning immediately appeared as a natural tool 
for sorting, analyzing, and extracting useful information 



Table 2 Classification performances 



G 3 G 4 G 5 G 6 



Classifier 


Sens 


CI 


Spec 


CI 


Sens 


CI 


Spec 


CI 


Sens 


CI 


Spec 


CI 


Sens 


CI 


Spec 


CI 


K-NN + rule 


0.70 


[0.61-0.79] 


0.42 
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0.70 
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0.45 
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0.55 


[0.49-0.62] 


0.59 


[0.48-0.70] 


RF + rule 


0.61 


[0.53-0.69] 


0.76 


[0.69-0.83] 


0.75 


[0.71-0.79] 


0.68 


[0.60-0.77] 


0.69 


[0.66-0.73] 


0.78 


[0.70-0.86] 


0.77 


[0.70-0.84] 


0.63 


[0.53-0.73] 


N-NET+rule 


0.38 


[0.23-0.53] 


0.68 


[0.53-0.83] 


0.80 


[0.70-0.90] 


0.50 


[0.43-0.57] 


0.57 


[0.51-0.63] 


0.71 


[0.59-0.84] 


0.81 


[0.75-0.87] 


0.34 


[0.25-0.44] 


K-NN-OCsocmt 


0.74 


[0.67-0.82] 


0.20 


[0.17-0.23] 


0.90 


[0.87-0.93] 


0.29 


[0.23-0.35] 


0.91 


[0.89-0.92] 


0.23 


[0.21-0.25] 


0.86 


[0.81-0.90] 


0.16 


[0.10-0.21] 


KMeans-OGsoc/w 


0.47 


[0.41-0.53] 


0.66 


[0.60-0.73] 


0.54 


[0.50-0.57] 


0.69 


[0.62-0.76] 


0.65 


[0.61-0.70] 


0.67 


[0.63-0.71] 


0.61 


[0.53-0.69] 


0.50 


[0.38-0.62] 


PCA-OCsocmt 


0.40 


[0.35-0.45] 


0.79 


[0.77-0.81] 


0.36 


[0.31-0.42] 


0.77 


[0.69-0.84] 


0.29 


[0.25-0.34] 


0.76 


[0.69-0.82] 


0.29 


[0.23-0.35] 


0.79 


[0.72-0.86] 


Parzen-OC SO c/M7 


0.44 


[0.39-0.50] 


0.70 


[0.64-0.76] 


0.49 


[0.46-0.52] 


0.73 


[0.67-0.79] 


0.51 


[0.47-0.54] 


0.77 


[0.70-0.83] 


0.49 


[0.45-0.52] 


0.79 


[0.70-0.87] 


SVM-OCsocmt 


0.20 


[0.16-0.24] 


0.72 


[0.68-0.77] 


0.22 


[0.20-0.25] 


0.76 


[0.70-0.81] 


0.27 


[0.24-0.30] 


0.77 


[0.70-0.83] 


0.28 


[0.25-0.31] 


0.80 


[0.72-0.88] 


s K-NN-OWi/ 


0.99 


[0.98-1.00] 


0.01 


[0.00-0.02] 


1.00 


[1.00-1.00] 


0.00 


[0.00-0.00] 


1.00 


[1.00-1.00] 


0.00 


[0.00-0.00] 


1.00 


[1.00-1.00] 


0.00 


[0.00-0.00] 


KMeans-OC MO c\/ 


0.36 


[0.29-0.43] 


0.71 


[0.63-0.80] 


0.48 


[0.43-0.54] 


0.78 


[0.72-0.83] 


0.59 


[0.57-0.61] 


0.69 


[0.64-0.73] 


0.46 


[0.44-0.49] 


0.68 


[0.60-0.77] 


PCA-OC M ocv 


0.42 


[0.36-0.48] 


0.71 


[0.65-0.77] 


0.50 


[0.48-0.52] 


0.78 


[0.72-0.83] 


0.49 


[0.46-0.53] 


0.77 


[0.70-0.83] 


0.43 


[0.40-0.45] 


0.70 


[0.60-0.80] 


Parzen-OC/Moo/ 


0.36 


[0.31-0.40] 


0.70 


[0.64-0.76] 


0.34 


[0.28-0.41] 


0.78 


[0.72-0.83] 


0.32 


[0.28-0.36] 


0.77 


[0.70-0.83] 


0.22 


[0.20-0.24] 


0.70 


[0.60-0.80] 


SVM-OWi/ 


0.19 


[0.15-0.23] 


0.76 


[0.74-0.79] 


0.25 


[0.22-0.28] 


0.79 


[0.73-0.85] 


0.27 


[0.24-0.30] 


0.78 


[0.72-0.84] 


0.22 


[0.20-0.24] 


0.72 


[0.63-0.81] 



For each group of experiments results are provided in terms of average sensitivity (Sens) and specificity (Spec) with the related Confidence Intervals (CI). Confidence intervals are computed with 95% LOC. 
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Table 3 Average improvements 

SOCMT MOCV 

K-NN-OC KMeans-OC PCA-OC Parzen-OC SVM-OC K-NN-OC KMeans-OC PCA-OC Parzen-OC SVM-OC 

G_3 K-NN+rule +0.08 +0.01 -0.01 + 0.01 + 0.12 + 0.04 + 0.05 +0.01 +0.05 +0.11 

RF+rule +0.19 +0.12 +0.10 +0.12 +0.23 +0.15 +0.16 +0.12 +0.16 +0.22 

N-NET+rule +0.03 -0.04 -0.06 -0.04 + 0.07 -0.01 + 0.00 -0.04 +0.00 +0.06 

G_4 K-NN+rule -0.01 -0.05 -0.01 -0.05 + 0.06 + 0.10 -0.07 -0.08 +0.00 +0.03 

RF+rule +0.13 +0.09 +0.13 +0.09 +0.20 +0.24 +0.07 +0.06 +0.14 +0.17 

N-NET+rule +0.06 +0.02 + 0.06 -0.04 0.13 + 0.17 + 0.00 -0.01 +0.00 0.1 

G_5 K-NN+rule -0.04 -0.16 -0.05 -0.15 -0.04 + 0.04 -0.14 -0.14 -0.06 -0.05 

RF+rule + 0.2 +0.08 +0.19 +0.09 +0.20 +0.28 +0.1 + 0.1 +0.18 +0.19 

N-NET+rule +0.11 -0.01 + 0.10 + 0.00 + 0.11 + 0.19 + 0.01 +0.01 +0.09 +0.10 

G_6 K-NN+rule +0.06 +0.01 + 0.03 -0.07 + 0.03 + 0.07 + 0.00 +0.01 +0.11 +0.10 

RF+rule +0.19 +0.14 +0.16 + 0.06 + 0.16 +0.20 +0.13 +0.14 +0.24 +0.23 

N-NET+rule +0.07 +0.02 + 0.04 + 0.01 + 0.04 + 0.08 + 0.01 +0.02 +0.19 +0.11 

Results are provided in terms of improvement of the accuracy of multi-class classifiers with decision rules versus one-class classifiers. 



from these large amounts of data. Unfortunately, some 
peculiar characteristics of biological data, such as the 
large number of variables and the low number of sam- 
ples, challenged even the most robust machine learning 
algorithms, especially when considering their use in a 
real clinical setup. This paper shows how the use of 
simple decision rules can be used to add to state-of-the- 
art multi-class classifiers a rejection option able to dis- 
card samples that do not belong to any of the trained 
classes. Traditionally, this operation is performed by 
other rejection methods, like one-class classifiers, which 
do not perform very well on microarray data. The pro- 
posed solution has several advantages: 

♦ it can be easily plugged into an environment widely 
spread in several research groups; 

♦ it is simple and does not require high computational 
resources; 

♦ in general it performs better than other available 
solutions such as those based on one-class classifiers; 

♦ it automatically tunes all parameters for the rejection 
model, requiring minimum intervention from the user. 

Methods 

Multi-class classifier setup in R 

The three considered multi-classifiers (RF, k-NN and N- 
NET) have been trained in R resorting to the Classifica- 
tion And REgression Training (CARET) package, a set 
of functions that attempt to streamline the process for 
creating predictive models in R [30]. There are many 
different modeling functions in R. Some have different 
syntax for model training and/or prediction. CARET 
provides a uniform interface to these functions allowing 
for standardization of common tasks. 

Parameter tuning of each classifier in CARET is done 
via resampling; every candidate model is evaluated many 
times using cross-validation. The number of candidate 



models is set through the setTuneLength parameter. In 
our experiments we used setTuneLength=5. This value 
represents a good compromise in terms of computa- 
tional time of the training phase and accuracy of the 
results for our experimental setup. Resampling has been 
performed according to the following parameters that 
can be set in CARET using the trainControl function: 

♦ method = "LGOCV": evaluation of each candidate 
model is performed using leave-group-out-cross- 
validation, 

♦ number = 30: number of resampling for LGOCV, 

♦ p = 0.98: percentage of samples in the training set of 
each resampling. 

Data have been pre-processed before performing clas- 
sification for normalization and dimensionality 
reduction. 

Near zero variance (NZV) features have been removed 
from the data-set resorting to the nearZeroVar() func- 
tion of CARET. The resulting data have been then pro- 
cessed using the preProcess() function of CARET. This 
function allows us to create an object able to perform 
centering, scaling and dimensionality reduction using 
PCA. Normalization is performed based on training 
data, only. 

At this stage the data-set is ready for classification. 
The classification model for each classifier is built using 
the train function in CARET, and the class probability 
estimates for each sample in the test set are computed 
resorting to the extractProb() function. The way class 
probabilities are estimated by the extractProb() function 
is classifier dependent: 

♦ k-NN performs classification based on the k closest 
training examples. Majority voting is used to predict the 
target class. The class probability estimate for a class is 
the number of training neighbors belonging to the class 
over the k considered neighbors. 
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Figure 6 ROC curves. ROC curves comparing the best one-class classifiers with the multi-class classifiers coupled with the decision rules. 



♦ RF is an ensemble classifier that consists of many 
decision trees. It predicts the class that is the mode of 
the classes predicted by individual trees. Similarly to k- 
NN the class probability estimate for a class is the num- 
ber of individual decision trees predicting the class over 
the total number of decision trees in the forest. 

♦ N-NET predictions are performed by evaluating the 
values returned by each of the output neurons (one for 
each available class). The output layer typically outputs 
the value of a sigmoid function of the linear combina- 
tion of hidden layer values representing a posterior 



probability. This value is used as class probability 
estimate. 

Threshold setup modules 

A short overview of the CMA-ES 

The covariance matrix adaptation evolution strategy 
(CMA-ES) is an optimization method first proposed by 
Hansen, Ostermeier, and Gawelczyk [31] in mid 90s, 
and further developed in subsequent years [32,33]. It is 
particularly suited for solving optimization problems 
where no preliminary hypothesis on the solution can be 
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Figure 7 CMA-ES Objective functions performances for RF. ROC graphs comparing the accuracy of RF while computing the rejection 
thresholds with different objective functions. 



derived. It is therefore a good choice for our specific 
problem, in which we want to compute the optimal 
rejection thresholds with a complete automated 
approach. In our specific application provided solutions 
are represented by a vector x = (T max , T re j, T di ff) con- 
taining the three rejection thresholds. 

In the CMA-ES, iteration steps are called generations 
due to its biological foundations. The value of a generic 
algorithm parameter y during generation g is denoted 
with y^ g \ The mean vector e R n represents the 
favorite, most-promising solution so far. The step size <7^ 
g R + controls the step length, and the covariance 
matrixQ^ e R nxn determines the shape of the distribu- 
tion ellipsoid in the search space. Its goal is, loosely 



speaking, to fit the search distribution to the contour 
lines of the objective function /to be minimized. C (0) = I 
In each generation g, X new solutions x ( ^ +1) e R n are 
generated by sampling a multi-variate normal distribu- 
tion J\f (0, C) with mean 0 (see equation 1). 



m 



} ,(cr^) 



,k = l,... l l (1) 



Where the symbol ♦ ~ ♦ denotes the same distribution 
on the left and right side. 

After the sampling phase, new solutions are evaluated 
and ranked. denotes the i th ranked solution point, 
such that fix 1: x) < ... < /(x^.J. The [i best among the X 
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Table 4 Reliability of the trained model 





G 3 






G 4 


G 5 




G 6 






acc 


f 


acc 


f 


acc 


f 


acc 


f 


RF+Rule 


0.68 


0.21 


0.72 


0.23 


0.72 


0.29 


0.70 


0.24 


N-NET+Rule 


0.54 


0.40 


0.65 


0.40 


0.65 


0.40 


0.58 


0.40 


k-NN+Rule 


0.56 


0.40 


0.58 


0.37 


0.51 


0.38 


0.57 


0.41 



The value of the objective function is used as an indicator of the reliability of 
the trained model. 



are selected and used for directing the next generation 
g + 1. First, the distribution mean is updated (see equa- 
tion 2). 



m 



+1 ) = ^ Wi x\ g \ w 1 >...>w fi > 0,^T w { = 1 (2) 



i=i 



In order to optimize its internal parameters, the 
CMA-ES tracks the so-called evolution paths, sequences 
of successive normalized steps over a number of genera- 
tions. p(p e IR n is the conjugate evolution path. 

T (n+l\ 

pW=0.yf2 v 2 n ' ~yfH + o(±) is the expectation of the 
Euclidean norm of a AT (0, 1) distributed random vec- 



tor, used to normalize paths. ju ei 



is usually 



denoted as variance effective selection mass. Let c G < 1 
be the learning rate forcumulation for the rank-one 
update of the covariance matrix; d a « 1 be the damping 
parameter for step size update. Paths are updated 
according to equations 3 and 4. 



p<* +1) = (1 - c a )pM + ^ Cff (2-c ff K ff . 



(3) 



„(«+!) =a (x) eX p 



U+i) 

P<7 



-1 



V2- 



(4) 



p(s) e jg>« i s the evolution path, p(°) = 0 • Let c c < 1 be 
the learning rate for cumulation for the rank-one update 
of the covariance matrix. Let ^ cov be parameter for weight- 
ing between rank-one and rank-^ update, and c cov < 1 be 
learning rate for the covariance matrix update. The covar- 
iance matrix C is updated (equations 5 and 6). 



p(* +1 > = (1 - C £ )p?) + Vc e (2-C e )/ieff 



,(x) 



r («) 



(5) 



C (« +1 U(i- Ccov )C^+^ 

A^cov 



+^cov I 1 ■ 



1=1 



(6) 



where OP (X) = XX T = OP(-X). 

Most noticeably, the CMA-ES requires almost no 
parameter tuning for its application. The choice of strat- 
egy internal parameters is not left to the user, and even 
X and [i default to acceptable values. Notably, the default 
population size X is comparatively small to allow for fast 
convergence. Restarts with increasing population size 
have been demonstrated [34] to be useful for improving 
the global search performance, and it is nowadays an 
option included in the standard algorithm. 
Objective functions 

Four objective functions have been evaluated in the 
optimization process, all of them trying to optimize the 
outcome of the classification process. The optimization 
process stops when it reaches one of three possible 
conditions: 

1. / reaches a predefined lower bound. This represents 
the best condition corresponding to the identification of 
the optimum solution; 

2. The value of /for the current population does not 
change more than a given value 5. The CMA-ES 
reached a local optimum that cannot be further 
improved with the current population; 

3. The value of/ for the last p populations does not 
change more than a given value a <5. Again the CMA- 
ES reached a local optimum. In this case despite the 
solution can be still slightly improved, globally, the 
increment is not significant and therefore it is not worth 
continuing the optimization. 

Sensitivity and specificity are common indicators of 
the efficiency of a binary classification test that can be 
exploited in the definition of the objective function. 
They are here computed taking into account that the 
outcome of the classification rule may also produce 
uncertain results: 



TP 



sens = ■ 



spec 



TP + FN + TR, 



TN 

TN + FP + FP„ 



(7) 



(8) 
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Based on these two indicators we tested three objec- 
tive functions defined as follows: 



SS = (1 - sens) + (1 - spec) 



SSI = 1 



sens • spec 
sens' + spec 



SS2 = 1- (sens • spec) 



(9) 



(10) 



(11) 



As required by the CMA-ES that is designed to 
minimize the objective function, greater values of sen- 
sitivity and specificity decrease the value of the objec- 
tive function. The first function considers the 
contribution of sensitivity and specificity separately, 
thus allowing for solutions where mostly only one of 
the two indicators is maximized. The second and the 
third functions try to leverage this problem by forcing 
the optimization towards results where both sensitivity 
and specificity are equally maximized. In particular 551 
seems to be the best objective function able to take 
into account the relationship between sensitivity and 
specificity 

Similarly to sensitivity and specificity, the F-Score^ is 
another statistical indicator of the outcome of a binary 
test considering the precision p and the recall r of the 
test. Again when computing p and r uncertain results 
should be considered as follows: 



P = 



TP 



TP + TP unc +FP + FP unc 



TP 



r = - 



TP + FN + TP U 



(12) 



(13) 



The F-Score^ has been exploited as objective function 
of the CMA-ES as follows: 



F -score R 



FSa =1- 



p 2 -p' + r' 



(14) 



FS1 and F50.5 come from the FSp where /? is set to 1 
and 0.5 respectively. Experimental results demonstrated 
that this function is quite inefficient since it tends to 
privilege increments of TP penalizing TN. 



List of abbreviations used 

ALL: acute lymphoblastic leukemia; AML2: acute myeloid leukemia 2; CARET: 
classification and regression training; CBF-AML: core binding factor acute 
myeloid leukemia; cDNA: complementary DNA; CI: confidence intervals; 



CMA-ES: covariance matrix adaptation evolutionary strategy; CLLww: B-cell 
chronic lymphocytic leukemia wait&watch; CLL: B-cell chronic lymphocytic 
leukemia; DNA: deoxyriboNucleic acid; DLBCL: diffuse large B-cell lymphoma; 
FL: follicular lymphoma; FP: false positives; FN: false negatives; k-Means-OC: 
k-Means one-class classifier k-NN, k-nearest neighbors; k-NN-OC: k-NN one- 
class classifier LOC, level of confidence; LGOCV: leave group out cross 
validation MOCV, multi one-class with voting; N-NET: neural network; NZV: 
near zero variance; Parzen-OC: Parzen one-class classifier PCA, principal 
component analysis; PCA-OC: PCA one-class classifier ROC, receiver operating 
characteristic; RF: random forests; Sens: sensitivity SOCMT, single one-class 
with multiple targets; Spec: specificity SVDD, support vector data description; 
SVM: support vector machine; SVM-OC: SVM one-class classifier TP, true 
positives; TN: true negatives. 
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